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Abstract 

We propose the variable selection procedure incorporating prior constraint information into 
lasso. The proposed procedure combines the sample and prior information, and selects signifi- 
cant variables for responses in a narrower region where the true parameters lie. It increases the 
efficiency to choose the true model correctly. The proposed procedure can be executed by many 
constrained quadratic programming methods and the initial estimator can be found by least square 
or Monte Carlo method. The proposed procedure also enjoys good theoretical properties. More- 
over, the proposed procedure is not only used for linear models but also can be used for generalized 
linear models(GLM), Cox models, quantile regression models and many others with the help of 
Wang and Leng (2007)'s LSA, which changes these models as the approximation of linear models. 
The idea of combining sample and prior constraint information can be also used for other modi- 
fied lasso procedures. Some examples are used for illustration of the idea of incorporating prior 
constraint information in variable selection procedures. 

Keywords: lasso; linear models; prior constraint information; sample information; variable 
selection; 

1. Introduction 

In practice, a number of variables are included into an initial regression analysis, but many of 
them may not be significant to the response variables and should be excluded from the final model 
in order to increase the accuracy of prediction and interpretation. Variable selection is fundamental 
in statistical modeling. The least absolute shrinkage and selection operator (LASSO) (Tibshirani 
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1996) is a useful and well-studied approach to the problem of variable selection (Knight and 
Fu 2000; Fan and Li 2001; Leng et al. 2006; Wang et al. 2007a; Yuan and Lin 2007). It shrinks 
some coefficients and sets others to 0, and hence tries to retain the good features of both subset 
selection and ridge regression. Moreover, lasso's major advantage is its simultaneous execution 
of both parameter estimation and variable selection. In particular, allowing an adaptive amount of 
shrinkage for each regression coefficient results in an estimator which is as efficient as oracle (Zou 
2006; Wang et al. 2007b; Wang and Leng 2007). About the computational techniques, please see 
Osborne et al. (2000), Efron et al. (2004), Rosset (2004), Zhao and Yu (2004) and Park and Hastie 
(2006). 

In spite of that, in variable selection or the estimation of regression coefficients, except for sam- 
ple information, some prior constraint information can be known. Constraints can be expressed 
as g(J3) < including equalities and inequalities where g(-) are k-dimensional linear or nonlinear 
functions (see Rao and Toutenburg 1995; Silvapulle and Sen 2005). In fact, the common simple 
order /?i < • • • < J3 P ; tree order f3, < fi p for i = 1, • • • , p - 1 ; umbrella order /?i < • • • < /?/ > • • • > J3 P 
or more generally A/3 < a are only the special cases of g(J3) < 0. All these constraints have very 
important applications in biomedical studies, life science, econometrics and social research etc. 
For example, in many biomedical studies, treatment groups in a clinical trial many be formulated 
according to increasing levels of dosage of a drug and the severity of disease in patients. In econo- 
metrics, the homogeneity of degree zero of a demand equation implies that the price and income 
elasticities add up to zero, whereas the negativity of the substitution matrix in consumer demand 
theory requires that all latent roots of the substitution matrix should be nonpositive. Stahlecker 
(1987) shows a variety of examples from the field of economics (such as input-output models), 
where the constraints for the parameters are so-called workability conditions of the form /?, > 
or Pi e (cii,bj) or E(y t \X) < a t . Literature deals with this problem under the generic term con- 
strained least squares (see Judge and Takayama 1966; Dufour 1989; Geweke 1986; Moors and 
van Houwelingen 1987; Rao and Toutenburg P75 1995). Dorfman and Mcintosh (2001) show 
that imposing the curvature conditions on a system of demand equations improves the MSEs on 
estimated elasticities from 2 to 50% depending on the signal-to-noise ratio and the sample size. 
For researchers, it will increase the efficiency of variable selection and parameter estimation to 
effectively combine the sample and prior information because prior information tells us a narrower 
region to select these variables and estimate these parameters. 
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This paper proposes a procedure to combine prior and sample information into lasso and hopes 
to obtain more accurate variable selection and parameter estimation. The idea of combining prior 
constraint and sample information can be shown by the black region in Figure 1. It shows that 
when we know some prior information of parameters, then variable selection will be executed in a 
narrower black region AEFD not in a wide region ABCD. It will increase the efficiency of choosing 
the true model correctly. Moreover, our procedure incorporating prior constraint information is not 
only used for linear models but also can be used for generalized linear models, Cox models and 
quantile regression models with the help of Wang and Leng (2007)'s LSA, which changes these 
models as the approximation of linear models. In fact, prior constraint information can be also used 
for other modified lasso procedures, e.g. Tibrashini et al. (2005)'s fused lasso and the modified 
lasso procedure for an adaptive amount of shrinkage for each regression coefficient (Zou 2006; 
Wang et al. 2007b; Wang and Leng 2007) etc. 

The paper is organized as follows: Section 2 introduces variable selection procedure combin- 
ing sample and prior constraint information in lasso and other modified lasso procedures. Main 
theoretical properties are discussed in Section 3. Section 4 discusses degrees of freedom of the 
lasso procedure incorporating prior constraint information. The proposed procedure is illustrated 
by some examples in Section 5. Section 6 gives a short discussion. 

2. Variable Selection Combining Sample and Prior Constraint Information into Lasso 

2.1 Variable Selection Combining Sample and Prior Constraint Information into Lasso in 
Linear Models 

We first consider variable selection incorporating prior constraint information into lasso in 
linear models: 



n 




subject to 




or 




subject to 




and 




(1) 
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where Y = (y u • • • , y n ) T , X = (x[, • • • ,x T n ) T and g(-) are linear or nonlinear functions. That is, the 
modified lasso objective function is as follows 

n 

" x ^ )2 + Z W + (^ (2) ) r g(^) 

where and A {2) = (A\ , • • • , A. ) T are tuning parameters. The tuning parameters can be obtained 
by estimating the prediction error for the procedure incorporating prior constraint information 
into lasso by cross-validation (CV) as described in chapter 17 of Efron and Tibshirani (1993) or 
generalized cross-validation (GCV). The prediction error of prediction term fj(X) of CV is given 
by 

PE = E{Y - f)(X)} 2 . 
Then the value s yielding the lowest estimated PE is selected. 

In the following, we introduce how to choose the tuning parameters from CV in detail. Simi- 
larly, GCV can be used to choose the tuning parameters, /-fold CV is one of the methods to choose 
the tuning parameters s. /-fold CV is to split the n patterns into a training set of size n - 1 and a test 
of size /. /-fold CV averages the squared error on the left-out pattern over all the possible ways of 
obtaining such a partition. The advantage is that all the data can be used for training - none has to 
be held back in a separate test set. Take / = 1 for an example. Let 

ft* = Arg J min ^ (y t - xjfi) 2 subject to £ \fi h \ < s and g(J3) < ol (2) 



where ;) is the estimation on the training data x l5 • • •, x y -_i, x J+1 , • • •, x„ for j = 1, • • • ,n from the 

n , , ..2 

procedure incorporating prior constraint information into lasso. Let PE S = X (X/ _ x iP* J ^ e 
the estimated prediction error of 1-fold CV given the tuning parameter s. Then the chosen tuning 
parameters s is as follows 



= Arg {min PE S } = Arg J min £ ( yj - xffly. 



where s minimizes the estimated prediction error. Then the simultaneous parameter estimation and 
variable selection incorporating prior constraint information is as follows 

% = Arg J min ^<j ( - - xjfi) 2 subject to ^ \{3 h \ < § and g(J3) < 1 . (3) 



Remark 1 . (Algorithm) We know that the most important thing for obtaining j3 s is to compute 
If there are no constraints on the parameters, many well developed procedures can be used 
to find the solution for 



min ^ (y t - xJ/3) 2 subject to ^ \{3 h \ < s. 



P ■ , • ■ 

For example, quadratic programming (Tibshirani 1996), the shooting algorithm (Fu 1998), local 
quadratic approximation (Fan and Li 2001) and lease angle regression (LARS) (Efron et al. 2004). 
When there are prior constraint information, the above procedures can not be directly used for ©. 
But if some modifications are made for these algorithms, © may be solved by them. It will be an 
interesting topic for us in the future. In fact, many quadratic programming methods can be used to 
find the solution for Q (see Dantig and Eaves 1974). The solution of the quadratic programming 
does not yield a sparse solution. If a tolerance is set, the small parameter estimate can be regarded 
as 0. 

Remark 2. (Initial Estimator) In fact, the OLS estimator may be regarded as the initial estima- 
tor. But in order to obtain more accurate estimator, Monte Carlo method can be used for the initial 
estimator of CQ) or ©. The optimal problem (OQ) can be written as 

(Y - X/3) T (Y - X/3) = /3 T X T X/3 - 2{3 T X T Y + Y T Y 

= (j3- n) T lr l (J3 -n) + Y T (l- X(X T X)- i X T ) Y 
with known /u pxl = (x T x)~ l X T Y and X pxp = (x T x)~\ That is, 



/3 S = Arg I min(/3 - n) T ^~\p - n) subject to ^ \fi h \ < s and g(/3) < 1 



or 



j3 s = Arg max ~ [(fi - fif^ifi - n) + log(|S|)] subject to - 



'P'^' (4) 

I m < o 



where 



W = ~\iP ~ V) T Z- l Q3 log(|S|) (5) 

is just the log-density of N(ju, 2) regarding fi as a random variable. Randomly draw m = 100000 
samples Z l5 • • • , Z m from N(/u, 2) where Z ; - = (Z 1; -, • • •, Z p j) T for j = 1, • • • , m. Set Z oId as the initial 
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estimator which satisfies 

Zoid = Arg \ max /(Z.) subject to V \Z hj \ < s and g(Z.) < } . 

/=l,—,m 1 
V h=l ) 

2.2 Variable Selection Combining Sample and Prior Constraint Information into Other 
Modified Lassos 

The limitation of lasso is that all the regression coefficients share the same amount of shrinkage 
min 2 [yt ~ xfjS) + A 2 \Pj\. Then Wang et al. (2007b) extend the lasso to the modified lasso* 
criterion which allows for different tuning parameters for different coefficients 

n 2 p 

i=l ;=1 

In order to combining the sample and prior constraint information, variable selection procedure 
can be executed as follows 

,2 P 



i=l 7=1 

which not only uses the prior information but also overcomes the limitation of the traditional lasso 
procedure. 

Similarly, the prior constraint information can be incorporated into Tibshirani et al. (2005)'s 
fused lasso which encourages sparsity in their differences, i.e. flatness of the coefficient profiles /3 7 
as a function of j. 

2.3 Variable Selection Combining Sample and Prior Constraint Information into Lasso in 
Nonlinear Models 

The proposed variable selection procedure can not be directly used for nonlinear models, e.g. 
generalized linear models; Cox models and quantile regression models etc. But with the help of 
Wang and Leng (2007)'s LSA, the proposed variable selection procedure can be used for these 
nonlinear models. LSA regards 

(P-pft-Hp-fi) (6) 
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as the least square approximation of the original loss n~ l L n (fi) where fi is the unpenalized estimator 
obtained by minimizing L n (fi), S" 1 = n~ 1 L n 0) and L„(-) is the second derivatives of the loss 
function L„(-). The expression © is similar to the log-density l(fi) in ©. So it is clear that the 
lasso procedure incorporating prior constraint information can also be used for variable selection 
in nonlinear models with the help of the least squares approximation. 

3. Some Theoretical Properties 

In this section, we derive some theoretical results for the lasso combining the sample and prior 
constraint information that are analogous to those for the lasso and fused lasso (Knight and Fu 
(2000); Tibshirani et al (2005)). The penalized least squares criterion is 



with fi = (fii, ■ ■ ■ ,fi p ) T and x,- = (x n , x ip ) T , and the Lagrange multipliers A { n l) and A ( n ' are 
functions of the sample size n. Let the optimal solution be/?,,. 

For simplicity, we assume that p is fixed as n — » oo and g(-) are differential convex functions. 
The following theorem adequately illustrates the basic dynamics of the lasso combining sample 
and prior constraint information. 




Theorem 1 . If / <s/n -> A®(1 = 1, 2) and 



( 1 n \ 



C = lim -V XiXj 



is non-singular, then 



^n{fi n - fi) -» arg min V(u) 



u 



where 




and W has an n(0, cr 2 C) distribution. 
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Proof. 



1=1 



\J=1 



where A^ and A^ are functions of the sample size n. Define V n (u) by 



V n (u) = - u T x t / V^) 2 - e 2 } + 4 1} J + ^ - M + M + "/ V^) - gOS)) T 4 2) 



i=i 



with m = («i,---, M p ) r and note that V n (u) is minimized at V"0?« _ P)- First note that 



n 

Yj{(Si - u T Xil yfn) 2 - e 2 } 4 -2w r W + w r O< 



(=i 

with finite dimensional convergence holding trivially where 



C = lim 

n—. >oo 



1 " "\ 

- ^ XjXf and W ~ n(0, cr 2 C). 
n i=i / 



We also have 



7=1 



and 



( g(fi + u/^)- g (J3)) T A? = { d -^u) Af 



dp 



Thus V ra (w) —> V(u), with finite dimensional convergence holding trivially where 

V(u) = -2u T W + u T Cu + A{; ] £ [ujSgn(J3j)I(J3j * 0)] + \uj\KJ3j = 0) + 4 2) - 



Since V ra is convex and V has a unique minimum, it follows (Geyer, 1996) that 

arg min V n (u) = ^fn(fi n - j3) arg min V{u). 

u u 

Theorem 2. The procedure incorporating prior constraint information into lasso will increase 
efficiency of selecting significant variables for responses compared with the traditional lasso pro- 
cedures. 
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Proof. Theoretically, the general lasso procedure is as follows 



fi s = Arg I min ^J(y t - xjfi) 2 subject to ^ \j3 h \ < s 




where GCV or CV is used to choose the tuning parameter s which minimizes the estimated pre- 
diction errors 



If the estimator /3 S satisfies prior constraints g(/?.v) < 0, it means that fi~ s clearly minimizes the 
estimated prediction errors in a narrower region g(J3) < 0. That is, 



where fi s is the estimator of parameter by the lasso procedure incorporating prior constraint infor- 
mation in (Q. Now, we take Figure 1 as an example. From Figure 1, we know that j3 s lies in the 
region ABCD and minimizes the estimated prediction errors. Moreover, we know that fi s lies in the 
region AEFD. It is clear that ySj minimizes the estimated prediction errors in the region above the 
line EF. Furthermore, the true model also lies in the region above the line EF. So we obtain that if 
jS f selects the true variables correctly, that is, the nonzero components of /? f are just the significant 
covariates, then fis also selects the true variables correctly. 

If fit doesn't select significant variables correctly, some prior constraint information may bring 
us into a narrower region to select these variables again. It will increase the efficiency of variable 
selection. 

4. Standard error and degrees of freedom of the lasso estimate 

Since our lasso procedure combining sample and prior constraint information is a nonlinear 
and nondifferentiable function of the response values even for a fixed value of s, it is difficult to 
obtain an accurate estimate of its standard error. The problem can be solved by bootstrap approach: 
either s can be fixed or we may optimize over s for each bootstrap sample. 



Efron et al. (2004) consider a definition of degrees of freedom using the formula of Stein 
(1981): 



s = Arg I min PE S 
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where y = (yi, • • • ,y n ) T is a multivariate normal vector with mean /u and covariance I, and h(y) 
is an estimator, an almost differential function from W to W. For the lasso with orthonormal 
design X T X = l pxp , the degrees of freedom are the number of non-zero coefficients. Tibshirani et 
al.(2005) show that the natural estimate of the degrees of freedom of the fused lasso is 

df((y)) = #{non-zero coefficient block in/3} 

= p- #{J3j = 0} - #{Pj-Pj-x = 0,J3 * 0} 

similarly, the natural estimate of the degrees of freedom of the lasso incorporating prior constraint 
information is 

df(y) = p-#{/3j = 0}-#{g(/3) = 0}. 
The degrees can be used for BIC-type tuning parameter selector. 

5. Some Examples 

In the following, we give three examples for illustration of the proposed procedure's practical 
applications in many models. 

Example 1 : linear inequality constraints in linear models 

Wolak (1989) or (Silvapulle and Sen 2005 P9) consider the following double-log demand func- 
tion 

log Q t = a + 01 log PE, + p 2 log PG t + log I t + 7l Dl t + y 2 D2 t + y 3 D3 t + e t 
which is a linear model where 

Q t = aggregate electricity demand, 

PE t = average price of electricity to the residential sector, 

PG t = price of natural gas to the residential sector, 

I t = income per capita, 

and Dl t , D2 t , D3 t are seasonal dummy variables. 
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Prior knowledge suggests that 
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which are linear inequality constraints. A typical model selection question is whether or not the 
foregoing model provides a better fit than the simpler model 

log Q, = a + y x D\, + y 2 D2 t + y 3 D3 t + e t . 

Wolak (1989) or Wang et al. (2007b) discuss the model selection problem by a test method or by 
a variable selection method, respectively. 

Example 2: nonlinear inequality constraints in linear models 

Dufour (1989) considers the following econometric model 

V; = f(Xi,p) + 6i = fii +P 2 X i2 +03X8 +fi 4 xf 2 +/3 5 -4 + 2fi 6 X i2 Xi3 + €i. 

This could be a production function or a unit cost function where y t is the production or unit cost 
and {x i2 , x i3 } are inputs. A problem of interest in econometrics is whether f(x t ,/3) is concave in x t , 
which can be expressed by the following nonlinear inequality constraints 

A<0, fe<0, 040s -f 6 >0. 

Dufour (1989) discusses the model selection problem by a test method. 

Example 3: linear equality and inequality constraints in generalized linear models 

An assay was carried out with the bacterium E. coli strain 343/358(+) to evaluate the genotoxic 
effects of 9-aminoacridine (9-AA) and potassium chromate (KCr). Piegorsch (1990) and Silvapulle 
(1994) consider the following log-linear model 

log(l - mj) =n + a t + Tj + rjij. (7) 

to evaluate whether potassium chromate and 9-AA have a synergistic effect where i = 1, 2, j = 
1, • • • ,5 and 

nij = Prjpositive response for a test unit in cell (i,j)}. 
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In fact, the log-linear model is just logistic regression model which is one of generalized linear 
models(GLM). To ensure that the parameters in © are identified, Piegorsch (1990) and Silvapulle 
(1994) impose the constraints a\ = Ti = rj n = rjij = for all (i, j) and 

' 1 \f ?7 22 \ ( 

10 7723 
10 ?724 ~ 
[O 1 JU25 J I J 

which means that potassium chromate and 9-AA have a synergistic effect. The model selection 

problem is analyzed by a test in Piegorsch (1990), Silvapulle (1994) and Silvapulle and Sen (2005 

P161). 

6. Discussion 

We proposed a modified lasso procedure combining prior constraint and sample information 
for variable selection and parameter estimation. The proposed procedure increases the efficiency of 
choosing the true model correctly because it executes variable selection and parameter estimation 
in a narrower region where the true parameters lie. The procedure may be computed by many 
quadratic programming methods. 

Moreover, the idea of incorporating prior constraint information can be used for other lasso 
procedures, e.g. fused lasso and modified lasso procedure for an adaptive amount of shrinkage for 
each regression coefficient. 

More work remains to be done. Efron et al. (2004)'s LARS is a good computational procedure 
which only needs p steps. But now it is not directly used for the lasso procedure incorporating 
prior constraint information. In our procedure, Monte Carlo estimator can be used for the initial 
estimator. How to extend LARS to the lasso procedure incorporating prior constraint information 
is an interesting topic for future study. 
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